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Learning 


Imagine a machine or organism that experiences over its 
lifetime a series of sensory inputs: 

X h X2 j £ 4 , • • • 

Supervised learning: The machine is also given desired 
outputs 2/1,2/2,..., and its goal is to learn to produce the 
correct output given a new input. 


Unsupervised learning: The goal of the machine is to 
build representations from x that can be used for 
reasoning, decision making, predicting things, 
communicating etc. 


Reinforcement learning: The machine can also 
produce actions a\, <12 ,... which affect the state of the 
world, and receives rewards (or punishments) r\,r2, ■ ■ ■ ■ 
Its goal is to learn to act in a way that maximises rewards 
in the long term. 







Goals of Unsupervised Learning 


To find useful representations of the data, for example 

• finding clusters, e.g. k-means, ART 

• dimensionality reduction, e.g. PCA, Hebbian 
learning, multidimensional scaling (MDS) 

• building topographic maps, e.g. elastic networks, 
Kohonen maps 

• finding the hidden causes or sources of the data 

• modeling the data density 

We can quantify what we mean by “useful” later. 



Uses of Unsupervised Learning 


data compression 
outlier detection 
classification 

make other learning tasks easier 
a theory of human learning and perception 



Probabilistic Models 


A probabilistic model of sensory inputs can: 

- make optimal decisions under a given loss 
function 

- make inferences about missing inputs 

- generate predictions/fantasies/imagery 

- communicate the data in an efficient way 

Probabilistic modeling is equivalent to other views of 
learning: 

- information theoretic: 

finding compact representations of the data 

- physical analogies: minimising free energy of a 
corresponding statistical mechanical system 



Bayes rule 


V — data set 

M — models (or parameters) 


The probability of a model M given data set V is: 


P{M\V) 


Pp\M)P{M) 

P(V) 


Pp\M) is the evidence (or likelihood) 

P(M) is the prior probability of M 
P(M\T>) is the posterior probability of M 

P{p) = fPp\M)P(M) dM 

Under very weak and reasonable assumptions, Bayes 
rule is the only rational and consistent way to manipulate 
uncertainties/beliefs (Polya, Cox axioms, etc). 



Bayes, MAP and ML 


Bayesian Learning: J L-x . v 

Assumes a prior over the model parameters.Computes 
the posterior distribution of the parameters: P( 0 \V). 


Maximum a Posteriori p. 

(MAP) Learning: J 

Assumes a prior over the model parameters P{ 6 ). 

Finds a parameter setting that 

maximises the posterior: P( 0 \V)<xP( 9 ) P(T>\ 9 ). 



Maximum Likelihood n_ 

(ML) Learning: X JL-x — _j 

Does not assume a prior over the model parameters. 

Finds a parameter setting that 

maximises the likelihood of the data: P(V\ 0 ). 



Modeling Correlations 



Consider a set of variables y\,... , yjj. 

A very simple model: 
means m = (yi) and 


correlations = (y iyj ) - ( yi){yj) 

This corresponds to fitting a Gaussian to the data 

P(y) = |27rE| _ 2exp|-^(y-^) T S _1 (y-^)| 


There are D(D + l)/2 parameters in this model 
What if D is large? 




Factor Analysis 



Linear generative model: 

K 

Vd x k “I” 

k=l 

• are independent jV(0,1) Gaussian factors 

• are independent jV(0, '1'^) Gaussian noise 

• K <D 

So, y is Gaussian with: 

P(y) = J P(x)P(y\x)dx = A/"(0, AA" 1 " + M/) 

where A is a D x K matrix, and ^ is diagonal. 

Dimensionality Reduction: Finds a low-dimensional 
projection of high dimensional data that captures most of 
the correlation structure of the data. 


Factor Analysis: Notes 



ML learning finds A and given data 
parameters (with correction from symmetries): 

DK + D- K[K -V < D{D + l] 

2 2 

no closed form solution for ML params 


Bayesian treatment would integrate over all A and ^ 
and would find posterior on number of factors; 
however it is intractable. 




Network Interpretations 



> 




decoder 

"generation" 


encoder 

"recognition" 


• autoencoder neural network 

• if trained to minimise MSE, then we get PCA 

• if MSE + output noise, we get PPCA 

• if MSE + output noises + reg. penalty, we get FA 






Graphical Models 


A directed acyclic graph (DAG) in which each node 
corresponds to a random variable. 




P(x 3 |xi,x 2 ) 

P(X 4 |X2)P(X 5 |X3,X4) 


Definitions: children , parents , descendents, ancestors 
Key quantity: joint probability distribution over nodes. 


P({xi,x 2) ... ,x n }) = P(X) 


(1) The graph specifies a factorization of this joint pdf: 


p(x) = n p( x *ip a (xi)) 


(2) Each node stores a conditional distribution over its 
own value given the values of its parents. 

(1) & (2) completely specify the joint pdf numerically. 

Semantics: Given its parents, each node is 
conditionally independent from its non-descendents 


(Also known as Bayesian Networks, Belief Networks, 
Probabilistic Independence Networks.) 



Two Unknown Quantities 


In general, two quantities in the graph may be unknown: 

• parameter values in the distributions P(xj|pa(x<)) 

• hidden (unobserved) variables not present in the data 

Assume you knew one of these: 

• Known hidden variables, unknown parameters 
=> this is complete data learning (decoupled 
problems) 



• Known parameters, unknown hidden variables 
=>■ this is called inference (often the crux) 



But what if both were unknown simultaneously... 




Learning with Hidden Variables: 
The EM Algorithm 



Assume a model parameterised by 9 with observable 
variables Y and hidden variables X 

Goal: maximise log likelihood of observables. 

£(0) = \nP(Y\0)=\nY,P{Y,X\9) 

X 

• E-step: first infer P(X\Y,8 0 i^), then 

• M-step: find 9 new using complete data learning 

The E-step requires solving the inference problem: 
finding explanations, X, for the data, Y 
given the current model 9. 



EM algorithm & ^-function 



Any distribution Q(X) over the hidden variables defines a 
lower bound on lnP(F|0) called X(Q,0)\ 

\nP(Y\6) = \nJ2 P(X,Y\0) = In 

X X 

> T,Q(X)\n ^X^l = X(Q,0) 

E-step: Maximise X w.r.t. Q with 0 fixed 

Q*(X) = P(X\Y,0) 

M-step: Maximise T w.r.t. 0 with Q fixed 

0* = max Y, Q*( x ) ln P( x . Y\0) 

9 X 

NB: max of 9) is max of In P(Y\0) 





Two Intuitions about EM 


I. EM decouples the parameters 



The E-step “fills in” values for the hidden vari¬ 
ables. With no hidden variables, the likeli¬ 
hood is a simpler function of the parameters. 
The M-step for the parameters at each n- 
ode can be computed independently, and de¬ 
pends only on the values of the variables at 
that node and its parents. 


II. EM is coordinate ascent in T 












EM for Factor Analysis 



F{Q, 0) = Jq(x) In P(x, y 1 9)dx - JQ(x) In Q(x)dx 

E-step: Maximise T w.r.t. Q with 6 fixed 

Q*(x) = P(x|y, 0) = N(j3y, I - j3A) 

(3 = A t (AA t + ^)- 1 

M-step: Maximise T w.r.t. 9 with Q fixed: 


In P(x, y\0) = — (x T x+ (y— Ax) T ^ x (y — Ax) + In |^|) + c 


• The E-step reduces to computing the Gaussian 
posterior distribution over the hidden variables. 

• The M-step reduces to solving a weighted linear 
regression problem. 


Inference in Graphical Models 



Singly connected nets 

The belief propagation 
algorithm. 



Multiply connected nets 

The junction tree algorithm. 


These are efficient ways of applying Bayes rule using the 
conditional independence relationships implied by the 
graphical model. 




How Factor Analysis is 
Related to Other Models 

Principal Components Analysis (PCA): Assume 
no noise on the observations: 'I' = lim e _g.o e -f 

Independent Components Analysis (ICA): Assume 
the factors are non-Gaussian (and no noise). 

Mixture of Gaussians: A single discrete-valued 
factor: xp. = 1 and xj = 0 for all j ± 

Mixture of Factor Analysers: Assume the data has 
several clusters, each of which is modeled by a 
single factor analyser. 

Linear Dynamical Systems: Time series model in 
which the factor at time t depends linearly on the 
factor at time t — 1, with Gaussian noise. 



A Generative Model for Generative Models 



















Mixture of Gaussians and K-Means 


Goal: finding clusters in data. 

To generate data from this model, assuming K clusters: 

• Pick cluster k e {1,... ,K} with probability 

• Generate data according to a Gaussian with 
mean and covariance E*. 

K 

p( y) = £ P{x = k)P(y\x = k) 

k =l 
K 

- J2 7r kN{y\nk,'Pk) 

k =1 


E-step: Compute responsibilities for each data vec. yW 


rki = P{x = fc|yW) - 




M-step: Estimate 7r*., ^ and £*. using data weighted by 
the responsibilities. 

The k-means algorithm for clustering is a special case of 
EM for mixture of Gaussians where £/.. = lim f _Q el 




Mixture of Factor Analysers 


Assumes the model has several clusters 
(indexed by a discrete hidden variable x). 

Each cluster is modeled by a factor analyser: 

M 

P( y) = V P(x = m)P(y\x = to) 

m—1 

where 

P{y\x = to) = Af(n m , A m A m T +'!') 

• it’s a way of fitting a mixture of Gaussians 
to high-dimensional data 

• clustering and dimensionality reduction 

• Bayesian learning can infer a posterior over the 
number of clusters and their intrinsic 
dimensionalities. 


Independent Components Analysis 



P(xf t ) is non-Gaussian. 

Equivalently P(a^) is Gaussian and 

K 

Vd=Yl ^dkd( x k) + e d 

k =1 

where g{- ) is a nonlinearity. 

For K = D, and observation noise assumed to be 
zero, inference and learning are easy (standard ICA). 
Many extensions possible (e.g. with noise => IFA). 


Hidden Markov Models/Linear Dynamical Systems 




• Hidden states {xj}, outputs {y^} 

Joint probability factorises: 

PSfrag replacements j 

p ({ x ), {y}) = n p (xi|x f _ 1 )P(y f |x f ) 

t =1 


• you can think of this as: 

Markov chain with stochastic measurements. 
Gauss-Markov process in a pancake. 

PSfrag replacements! ©—*©—*©—►••• *© j 



or 

Mixture model with states coupled across time. 
Factor analysis through time. 
























HMM Generative Model 




plain-vanilla HMM = 

“probabilistic function of a Markov chain”: 

1. Use a Ist-order Markov chain to generate a 
hidden state sequence (path): 

P(zi =j) =nj 
P(xt+1 = j\xt = i) =Tij 

2. Use a set of output prob. distributions Aj(-) (one 
per state) to convert this state path into a 
sequence of observable symbols or vectors 

P(y* = y\xt = j) = Aj{y) 


Notes: 

- Even though hidden state seq. is Ist-order Markov, the 
output process is not Markov of any order 

[ex. 1111121111311121111131 ... ] 

- Discrete state, discrete output models can approximate any 
continuous dynamics and observation mapping even if 
nonlinear; however lose ability to interpolate 


LDS Generative Model 




• Gauss-Markov continuous state process: 

x f +i = Ax f + wt 

observed through the “lens” of a 
noisy linear embedding: 

y t = C x< + Vi 

• Noises w. and v. are temporally white and 
uncorrelated with everything else 

• Think of this as “matrix flow in a pancake” 


A 






s ,*4 A 


r i 

/ i 

' a. i 


(Also called state-space models, Kalman filter models.) 


EM applied to HMMs and LDSs 



Given a sequence of T observations {Y\,... , Y T } 

E-step. Compute the posterior probabilities: 

• HMM: Forward-backward algorithm: -P({a:}|{y}) 

• LDS: Kalman smoothing recursions: P({x}|{y}) 

M-step. Re-estimate parameters: 

• HMM: Count expected frequencies. 

• LDS: Weighted linear regression. 

Notes: 

1. forward-backward and Kalman smoothing recursions 
are special cases of belief propagation. 

2. online (causal) inference P{xt\{Y\,... , Y*}) is done 
by the forward algorithm or the Kalman filter. 

3. what sets the (arbitrary) scale of the hidden state? 
Scale of Q (usually fixed at I). 




Trees/Chains 



Tree-structured = each node has exactly one parent. 

Discrete nodes or linear-Gaussian. 

Hybrid systems are possible: mixed discrete & 
continuous nodes. But, to remain tractable, discrete 
nodes must have discrete parents. 

Exact & efficient inference is done by belief 
propagation (generalised Kalman Smoothing). 

Can capture multiscale structure (e.g. images) 



Polytrees/Layered Networks 



• more complex models for which junction-tree 
algorithm would be needed to do exact inference 

• discrete/linear-Gaussian nodes are possible 

• case of binary units is widely studied: 

Sigmoid Belief Networks 


• but usually intractable 


Intractability 


For many probabilistic models of interest, exact inference 
is not computationally feasible. 

This occurs for two (main) reasons: 

• distributions may have complicated forms 
(non-linearities in generative model) 

• “explaining away” causes coupling from observations 
observing the value of a child induces dependencies 
amongst its parents (high order interactions) 



We can still work with such models by using approximate 
inference techniques to estimate the latent variables. 



Approximate Inference 



Sampling: 

approximate true distribution over hidden variables 
with a few well chosen samples at certain values 


Linearization: 

approximate the transformation on the hidden 
variables by one which keeps the form of the 
distribution closed (e.g. Gaussians and linear) 



Recognition Models: 

approximate the true distribution with an 
approximation that can be computed easily/quickly 
by an explicit bottom-up inference model/network 



Variational Methods: 

approximate the true distribution with an approximate 
form that is tractable; maximise a lower bound on the 
likelihood with respect to free parameters in this form 












Sampling 


Gibbs Sampling 

To sample from a joint distribution P(x 1 ,^ 2 ,... 

Start from some initial state x° = (xj, x[], • • • , x° N ): 
Then iterate the following procedure: 

• Pick x k+1 from P(xi\x 2l xf, x \^... , x k N ) 

• Pick x k+1 from P(x2\x k+1 1 xj, xj,... , x k N ) 


• Pick x ^ 1 from P(xn |xf +1 , x k+1 , Xg +1 ,... , x k ^}^) 

This procedure goes from x. k -> x^ +1 , creating a Markov 
chain which converges to P(x) 

Gibbs sampling can be used to estimate the expectations 
under the posterior distribution needed for E-step of EM. 

It is just one of many Markov chain Monte Carlo (MCMC) 
methods. Easy to use if you can easily update subsets of 
latent variables at a time. 

Key questions: how many iterations per sample? 
how many samples? 



Particle Filters 



Assume you have n weighted samples S t ~\ = ,... , } from 

P(x f _i|Z f _i), with normalised weights 

1. generate a new sample set by sampling with replacement 

from S t - 1 with probabilities proportional to 

2. for each element of S' predict, using the stochastic dynamics, 

by sampling s f w from P(x t |x t _i = sj^) 

3. Using the measurement model, weight each new sample by 

7 T® = Pfa\Xt = S^) 

(the likelihoods) and normalise so that = 1- 

Samples need to be weighted by the ratio of the distribution we draw 
them from to the true posterior (this is importance sampling). 

An easy way to do that is draw from prior and weight by likelihood. 
(Also known as Condensation algorithm.) 

































Linearization 


Extended Kalman Filtering and Smoothing 



Xf+l = /(x f ,Ui) +Wi 

y t = g(*t, u t) + v f 


Linearise about the current estimate, i.e. given ±t, uf. 


xt+1 « f(±t, u f ) + 


df_ 

dx t 


(xj - Xf) + w f 

A 

xt 


yt ~ ff(xt,u t ) + 


5x t 


(xf - Xf) + Vi 

/s 

Xi 


Run the Kalman smoother (belief propagation for 
linear-Gaussian systems) on the linearised system. This 
approximates non-Gaussian posterior by a Gaussian. 






















Recognition Models 



a function approximator is trained in a supervised 
way to recover the hidden causes (latent variables) 
from the observations 

this may take the form of explicit recognition network 
(e.g. Helmholtz machine) which mirrors the 
generative network (tractability at the cost of 
restricted approximating distribution) 


inference is done in a single bottom-up pass 
(no iteration required) 


Variational Inference 


Goal: maximise lnP(Y|0). 

Any distribution Q(X) over the hidden variables defines a 
lower bound on lnP(Y|0): 

In P(Y\0) > X; Q(X) In = HQ, 0) 

Constrain Q(X) to be of a particular tractable form (e.g. 
factorised) and maximise P subject to this constraint 

• E-step: Maximise T w.r.t. Q with 6 fixed, subject to 
the constraint on Q , equivalently minimise: 

\nP(Y\e)-HQQ) = XQPOln-^ 

= KL(Q||P) 

The inference step therefore tries to find Q closest to 
the exact posterior distribution. 

• M-step: Maximise T w.r.t. 6 with Q fixed 
(related to mean-field approximations) 




Beyond Maximum Likelihood: 

Finding Model Structure and Avoiding Overfitting 



M = 4 M = 5 M = 6 














Model Selection Questions 


How many clusters in this data set? 

What is the intrinsic dimensionality of the data? 
What is the order of my autoregressive process? 
How many sources in my ICA model? 

How many states in my HMM? 

Is this input relevant to predicting that output? 

Is this relationship linear or nonlinear? 



Bayesian Learning and Ockham’s Razor 


data!", models M\M n , parameter sets Ox6 n 

(let’s ignore hidden variables X for the moment; they will just 

introduce another level of averaging/integration) 


Total Evidence: 


P(Y) = Y / P(X\ Mj)P{Mj 
j 

Model Selection: 

P(Y\Mi)P(Mi 


P(Mi\Y) = 


P(Y) 


P(Y\Mi) — P{Y\6 l ,Mi)PiO l \M l 

J Uo 


• P(Y\Mi) is the probability that randomly selected 
parameter values from model class Mi would 
generate data set Y. 

• Model classes that are too simple will be very 
unlikely to generate that particular data set. 

• Model classes that are too complex can generate 
many possible data sets, so again, they are unlikely 
to generate that particular data set at random. 



P(Data) 


Ockham’s Razor 



Data Sets 


(adapted from D.J.C. MacKay) 








Overfitting 


M - 1 


M = 2 


M = 3 



M = 4 


M = 5 


M = 6 
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Practical Bayesian Approaches 


Laplace approximations 

Large sample approximations (e.g. BIC) 

Markov chain Monte Carlo methods 


Variational approximations 



Laplace Approximation 


data set Y, models M\... , M n , parameter sets 0i... , 6 n 


Model Selection: 

P(Mi\Y) cx P(Mi)P(Y\Mi) 

For large amounts of data (relative to number of 
parameters, d) the parameter posterior is approximately 

Gaussian around the MAP estimate Of 

P(9i\Y,Mi ) ~ (27r)^|A|f exp - 8i) T A(0, - f),)| 


P(Y\Mi) 


P{0i,Y\Mj) 

P(9 t \Y,M t ) 


Evaluating the above expression for \n.P(Y\Mi) at Of 

7 -I 

In P{Y\Mi) « In P(9i\Mi) + In P(Y\9 U Mt) + - In 2tt - - In \A\ 

where A is the negative Hessian of the log posterior. 


This can be used for model selection. 
(Note: A is size d x d.) 



BIC 


The Bayesian Information Criterion (BIC) can be obtained 

from the Laplace approximation 

7 -I 

In P(Y\Mi) ~ In P(4|M) + In P{Y\8 t , Mi) + - In 2 tt - - In \A\ 

by taking the large sample limit: 

InP(Y\Mi) ss \nP(Y\6 h Mi) + - In AT 

2 

where N is the number of data points. 

Properties: 

• Quick and easy to compute 

• It does not depend on the prior 

• We can use the ML estimate of 9 instead of the MAP 
estimate 

• It assumes that in the large sample limit, all the 
parameters are well-determined (i.e. the model is 
identifiable; otherwise, d should be the number of 
well-determined parameters) 

• It is equivalent to the MDL criterion 


MCMC 


Assume a model with parameters 8, hidden variables X 
and observable variables Y 

Goal: to obtain samples from the (intractable) posterior 
distribution over the parameters, P{0\Y) 

Approach: to sample from a Markov chain whose 
equilibrium distribution is P(8\Y). 

One such simple Markov chain can be obtained by Gibbs 
sampling, which alternates between: 

• Step A: Sample from parameters given hidden 
variables and observables: 8 ~ P(8\X, Y) 

• Step B: Sample from hidden variables given 
parameters and observables: X ~ P(X\8,Y) 

Note the similarity to the EM algorithm! 



Variational Bayesian Learning 


Lower bound the evidence: 


C = In P(Y) 


> 


> 


In J P(Y, 9)d0 





Jq(0) 

i “ F(i ' |9)+k w! 

dO 


Jq(0) 


i i 

+ 


P(Q(6),Q(X)) 


Assumes the factorisation: 

P{9,X\Y)kQ(6)Q{X) 


(also known as “ensemble learning”) 












Variational Bayesian Learning 


EM-like optimisation: 

“E-step”: Maximise P w.r.t. Q(X) with Q(9) fixed 

“M-step”: Maximise P w.r.t. Q(9) with Q(X) fixed 

Finds an approximation to the posterior over parameters 

Q(9) k, P(9\Y ) and hidden variables Q(X) « P(X\Y) 

• Maximises a lower bound on the log evidence 

• Convergence can be assessed by monitoring P 

• Global approximation 

• P transparently incorporates model complexity 
penalty (i.e. coding cost for all the parameters of the 
model) so it can be compared across models 

• Optimal form of Q(9) falls out of free-form variational 
optimisation (i.e. not assumed to be Gaussian) 

• Often simple modification of the EM algorithm 



Summary 


Why probabilistic models? 

Factor analysis and beyond 
Inference and the EM algorithm 
Generative Model for Generative Models 
A few models in detail 
Approximate inference 
Practical Bayesian approaches 



Appendix 



Desiderata (or Axioms) for 
Computing Plausibilities 


Paraphrased from E.T. Jaynes, using the notation p(A\B) 
is the plausibility of statement A given that you know that 
statement B is true. 

• Degrees of plausibility are represented by real 
numbers 

• Qualitative correspondence with common sense, e.g. 

- Wp(A\C') >p{A\C) but p{B\AkC') = p{B\AkC) 
then p{AkB\C') > p{AkB\C) 

• Consistency: 

- If a conclusion can be reasoned in more than one 
way, then every possible way must lead to the 
same result. 

- All available evidence should be taken into 
account when inferring a plausibility. 

- Equivalent states of knowledge should be 
represented with equivalent plausibility 
statements. 

Accepting these desiderata leads to Bayes Rule being 
the only way to manipulate plausibilities. 


Learning with Complete Data 



Assume a data set of i.i.d. observations 
V = {yW,... , y( n )} and a parameter vector 9. 


n 


2=1 


Equivalently, maximise log likelihood: 


n 


C{9) = £lnP(yW|0) 

2=1 

Using the graphical model factorisation: 

P(Y®\0) = Y[P(YP\Y® jy O j 

j 


n 


So: £(*) = ££ 


i =1 j 


Goal is to maximise likelihood: P(V\6) = J} P(Y^\8) 


In other words, the parameter estimation problem breaks 
into many independent, local problems (uncoupled). 







Building a Junction Tree 


Start with the recursive factorization from the DAG: 

P(X) = I] P(xdpa(x,)) 

i 

Convert these local conditional probabilities into 
potential functions over both x, and all its parents. 

This is called moralising the DAG since the parents 
get connected. Now the product of the potential 
functions gives the correct joint 

When evidence is absorbed, potential functions must 
agree on the prob. of shared variables: consistency. 

This can be achieved by passing messages between 
potential functions to do local marginalising and 
rescaling. 

Problem: a variable may appear in two 
non-neighbouring cliques. To avoid this we need to 
triangulate the original graph to give the potential 
functions the running intersection property. 

Now local consistency will imply global consistency. 



Bayesian Networks: Belief Propagation 




Each node n divides the evidence, e, in the graph into 
two disjoint sets: e + (n) and e~(n) 

Assume a node n with parents {p\, ... ,Pk} and 
{ci,... ,c £ } 

P(n\e) oc £ P{n\p u ... ,p k ) 

_{Pu- ,Pk} 

t 

UP^~(c 3 )\n) 

3 =1 

ex P{n\e + (n))P(e~ (n)\n) 








ICA Nonlinearity 


Generative model: 


x = g( w) 

y = Cx + v 

where w and v are zero- 
mean Gaussian noises 
with covariances I and R 
respectively. 



The density of x can be written in terms of g(-), 


Px(x) 

For example, if p x (x) 


\g'{g~ l {x))\ 

— \-j —t we find that setting: 

7rcosh(a:J ° 


g(w) = In ftan ^1 + erf(r<;/ i /2))j J 

generates vectors x in which each component is 
distributed exactly according to l/(7rcosh(x)). 


So, ICA can be seen either as a linear generative model 
with non-Gaussian priors for the hidden variables, or as a 
nonlinear generative model with Gaussian priors for the 
hidden variables. 




HMM Example 

• Character sequences (discrete outputs) 





• Geyser data (continuous outputs) 


State output functions 



yi 















True Population Inferred Population 


LDS Example 


• Population model: 
state population histogram 
first row of A birthrates 
subdiagonal of A 1-deathrates 
Q ^ immigration/emmigration 
C ^ noisy indicators 
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Viterbi Decoding 


The numbers 7 j(t) in forward-backward gave the 
posterior probability distribution over all states at any 
time. 


By choosing the state 7 *(£) with the largest 
probability at each time, we can make a “best” state 
path. This is the path with the 
maximum expected number of correct states. 


But it is not the single path with the highest likelihood 
of generating the data. 

In fact it may be a path of probability zero! 


To find the single best path, we do Viterbi decoding 
which is just Bellman's dynamic programming 
algorithm applied to this problem. 


The recursions look the same, except with 
max instead of jT. 


There is also a modified Baum-Welch training based 
on the Viterbi decode. 



HMM Pseudocode 


• Forward-backward including scaling tricks 

Qj(t) = My t) 

a(l) = n.*q(l) p(l) = ^a(l) a(l) = a(l)/p(l) 

a(t ) = ( T' * a(t — 1)). * q{t) pit) = ^ ^ ajt ) a(t ) = a(t)/p(t) [t = 2 : r] 

/3(r) = 1 

/3(t) = T * (/3(t + 1). * g(£ + l))/p(t +1) [t = ( r - 1) : 1] 

£ = 0 

£ = £ + T. * ( a(t ) * (/3(t + 1). * g(£ + 1 ))')/p(^ + 1) [t = 1 : (r - 1)] 

7 = (a. * /3) 

logPfyD = y^iog(pW) 

• Baum-Welch parameter updates 

Sj = 0 Tij = 0 7r = 0 ^4 = 0 

for each sequence, run forward backward to get 7 and £ , then 

T = T + £ 7r = 7r + 7 (l) <5 = <5 + y^7(£) 

t 

Aj(y) = ^ ^W or ^ = i + ^ygjt) 

t\y t =y t 

Tij = Tij/ ^ ^ Tik 7T = 7r/ ^ ^ 7T = Aj/5j 



LDS Pseudocode 


• Kalman filter/smoother including scaling tricks 


x + = x 0 

< 

+ 

II 

< 

O 

Pt=N (Cx + , CV + C' + R) j y , 

[t=l:r] 

K = V + C / (CV + C / + R )- 1 

X, = x+ + K(y, - Cx+) 

V, = (I - KC)V+ 

x + = Ax, 

V + = AV ( A' + Q 

x r = x r 

v T = v r 

V + = AV t A' + Q 

[t = (r-l):l] 

J = VfA'(V + ) _1 

x* = x f + J(x m - Ax f ) 

v t = V, + J(V t+1 - V+)J' 


log P(y[) = y>g(p(t)) 


• EM parameter updates 

<5=0 /3 = 0 7 = 0 a = 0 xq = 0 


for each sequence, run Kalman smoother to get x f , x t , V t and V f 

X 0 = X„ + xj/JV Vj., = t-\A! + Q) _1 V t 


(5 = <5 + ^ y,x' ( 

t 

7 = 7 + + 

t 


(3 = P + ^2 + V* 1 

t =2 



7r = 7r + X r x' r + V r 


7i = 7i + xixi + Vi 


C = (57 _1 
R = (a-C 6')/t n 


A = {3(j — 7 r ) 1 
Q = (7 - 7i - A/3 ')/(tjv - A) 
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